I-BERT: Integer-Only BERT Quantization
127
TABLE 5.2
Quantization results for BERT-base on SST-2. Results are obtained with 128 groups in
each layer.
Method
w-bits
e-bits
Acc
Size
Size-w/o-e
Baseline
32
32
93.00
415.4
324.5
Q-BERT
8
8
92.88
103.9
81.2
DirectQ
4
8
85.67
63.4
40.6
Q-BERT
4
8
92.66
63.4
40.6
DirectQ
3
8
82.86
53.2
30.5
Q-BERT
3
8
92.54
53.2
30.5
Q-BERT(MP)
2/4(MP)
8
92.55
53.2
30.5
DirectQ
2
8
80.62
43.1
20.4
Q-BERT
2
8
84.63
43.1
20.4
Q-BERT(MP)
2/3(MP)
8
92.08
48.1
25.4
Note: The quantization bits used for weights is abbreviated as “w-bits,” embedding as
“e-bits,” model size in MB as “Size,” and model size without embedding layer in MB as
“Size-w/o-e.” For simplicity and efficacy, all the models except for Baseline are using 8-bits
activation. Here “MP” refers to mixed-precision quantization.
(number of heads) value matrices Wv are concatenated together, resulting in a 3-d tensor.
For layer-wise quantization, as shown in Fig. 5.6(a), the entire 3-d tensor will be quantized
into the same range of discrete numbers. A special case of group-wise quantization is that
each dense matrix is a group, and every matrix can have its own quantization range as
shown in Fig. 5.6(b). A more general case in Fig. 5.6(c) instead provides a more general case
where each dense matrix with respect to output neuron is partitioned, and every continuous
d
2Nh output neurons is bucketed as a group.
The results of Q-BERT on the development set of SST-2 are presented Table 5.2. SST-2
is a movie review dataset with binary annotations, where the binary label indicates positive
and negative reviews. It can be seen that Q-BERT outperform the baseline by a large margin
over various bit pricsions.
5.4
I-BERT: Integer-Only BERT Quantization
Kim et al. [118] propose I-BERT to construct an integer-only BERT. Their motivation
comes from the fact that previous quantization schemes for transformer-based language
models use simulated quantization (fake quantization), where all or part of operations in
the inference (e.g., GELU, Softmax, and Layer Normalization) are carried out with floating
point arithmetic. Such approaches are illustrated in the left side of Fig. 5.4. However, such
approaches are hard to deploy in real-edge application scenarios where many neural accel-
erators or popular edge processors do not support floating-point arithmetic. To solve these
challenges, an integer-only quantization for Bert is necessary. Specifically, the proposed
I-BERT incorporates a series of novel integer-only quantization schemes for transformer-
based language models including new kernels for the efficient and accurate integer-only